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Abstract 

We learn multiple hypotheses for related 
tasks under a latent hierarchical relationship 
between tasks. We exploit the intuition that 
for domain adaptation, we wish to share clas- 
sifier structure, but for multitask learning, we 
wish to share covariance structure. Our hi- 
erarchical model is seen to subsume several 
previously proposed multitask learning mod- 
els and performs well on three distinct real- 
world data sets. 



1 INTRODUCTION 

We consider two related, but distinct tasks: do- 
main adaptation (DA) [H [TJ [7] and multitask learning 
(MTL) [5j|2]. Both involve learning related hypothe- 
ses on multiple data sets. In DA, we learn multiple 
classifiers for solving the same problem over data from 
different distributions. In MTL, we learn multiple clas- 
sifiers for solving different problems over data from the 
same distribution^ Seen from a Bayesian perspective, 
a natural solution is a hierarchical model, with hy- 
potheses as leaves [BJ [THl US]- However, when there 
are more than two hypotheses to be learned (i.e., more 
than two domains or more than two tasks), an imme- 
diate question is: are all hypotheses equally related? 
If not, what is their relationship? We address these is- 
sues by proposing two hierarchical models with latent 
hierarchies, one for DA and one for MTL (the models 
are nearly identical) . We treat the hierarchy nonpara- 
metrically, employing Kingman's coalescent [T2]. We 
derive an EM algorithm that makes use of recently 

1 We note that this distinction is not always maintained 
in the literature where, often, DA is solved but it is called 
MTL. We believe this is valid (DA is a special case of 
MTL), but for the purposes of this paper, it is important 
to draw the distinction. 



developed efficient inference algorithms for the coales- 
cent [T3]. On several DA and MTL problems, we show 
the efficacy of our model. 

Our models for DA and MTL share a common struc- 
ture based on an unknown hierarchy. The key dif- 
ference between the DA model and the MTL model 
is in what information is shared across the hierarchy. 
For simplicity, we consider the case of linear classi- 
fiers (logistic regression and linear regression). This 
can be extended to non-linear classifiers by moving to 
Gaussian processes [16] . In domain adaption, a use- 
ful model is to assume that there is a single classifier 
that "does well" on all domains [HUH]. In the context 
of hierarchical Bayesian modeling, we interpret this as 
saying that the weight vector associated with the lin- 
ear classifier is generated according to the hierarchical 
structure. On the other hand, in MTL, one does not 
expect the same weight vector to do well for all prob- 
lems. Instead, a common assumption is that features 
co- vary in similar ways between tasks [T3] In a 
hierarchical Bayesian model, we interpret this as say- 
ing that the covariance structure associated with the 
linear classifiers is generated according to the hierar- 
chical structure. In brief: for DA, we share weights; 
for MTL, we share covariance. 

2 BACKGROUND 

2.1 RELATED WORK 

Yu et al. [lb] have presented a linear multitask 
model for domain adaptation. In the linear multitask 
model, a shared mean and covariance is generated by 
a Normal-Inverse- Wishart prior, and then the weight 
vector for each task is generated by a Gaussian con- 
ditioned on this shared mean and variance. The key 
idea in the linear multitask model [16 is to model fea- 
ture covariance; this is also the intuition behind the 
informative priors model [13] . carried out in a more 
Bayesian framework. (The linear multitask model is 
almost identical to the conjoint analysis model [6]). 
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Figure 1: Variables describing the TV-coalescent . 



Xue et al. jTB] present a Dirichlet process mixture 
model formulation, where domains are clustered into 
groups and share a single classifier across groups. This 
helps to prevent "negative transfer" (the effect of 
"unrelated" tasks negatively affecting performance on 
other tasks). Xue et al.'s model is effectively a task- 
clustering model, in which some tasks share common 
structure (those in the same cluster), but are otherwise 
independent from other tasks (those in other clusters) . 
This work was later improved on by Dunson, Xue and 
Carin [5] in the formulation of the matrix stick break- 
ing process: a more flexible approach to Bayesian mul- 
titask learning that allows for more sharing. 

This is also a large body of work on non-Bayesian ap- 
proaches to multitask learning and domain adaptation. 
Bickel et al. [5] offer an extension of the logistic regres- 
sion model that simultaneously learns a good classifier 
and a classifier to provide instance weights for out of 
sample data. This approach is only applicable when 
no labeled "target" data is available, but much unla- 
beled target data is. Blitzer, McDonald and Pereira 
[3] present another approach to this "unsupervised" 
setting of domain adaptation that makes use of prior 
knowledge of features that are expected to behave sim- 
ilarly across domains. Both of these approaches are 
developed only in the two-domain setting. Dredze and 
Crammer [8 describe an online approach for dealing 
with the many-domains problem, sharing information 
across domains via confidence-weighted classifiers. 

2.2 KINGMAN'S COALESCENT 

Our model for DA and MTL makes use of a latent hier- 
archical structure. Being Bayesian, we wish to attach 
a prior distribution to this hierarchy. A convenient 
choice of prior is Kingman's coalescent [15]. Our de- 
scription and notation is borrowed directly from [14 . 
Kingman's coalescent originated in the study of pop- 
ulation genetics for a set of haploid organisms (organ- 
isms which have only a single parent). The coalescent 
is a nonparametric model over a countable set of or- 
ganisms. It is most easily understood in terms of its 
finite dimensional marginal distributions over N in- 
dividuals, in which case it is called an 7V-coalcsccnt. 



We then take the limit N — > oo. In our case, the N 
individuals will correspond to N classifiers (tasks). 

The A-coalescent considers a population of N organ- 
isms at time t = (see Figure [I] for an example with 
N = 4). We follow the ancestry of these individu- 
als backward in time, where each organism has ex- 
actly one parent at time t < 0. The TV-coalescent 
is a continuous-time, partition-valued Markov process 
which starts with N singleton clusters at time t = 
and evolves backward, coalescing lineages until there is 
only one left. We denote by t; the time at which the ith 
coalescent event occurs (note U < 0), and 5, = f,- — 
the time between events (note Si > 0). Under the N- 
coalesccnt, each pair of lineages merges independently 

with exponential rate 1; so Si ~ Exp ({ N ~2 +1 )) ■ With 
probability one, a random draw from the 7V-coalescent 
is a binary tree with a single root at t = — oo and N 
individuals at time t = 0. We denote by tt the tree 
structure and by 6 the collection of {Si}. Leaves arc 
denote by x n and internal nodes by j/i, where i indexes 
a coalescent event (see Figure [TJ . The marginal distri- 
bution over tree topologies is uniform and independent 
of t, 8; and the model is infinitely exchangeable. We 
consider the limit as N — > oo, called the coalescent. 

Once the tree structure is obtained, one can define an 
additional Markov process to evolve over the tree. One 
common, and easy to understand, choice is a Brownian 
diffusion process. In Brownian diffusion in D dimen- 
sions, we assume an underlying diffusion covariance of 
A G M- Dx - D positive semi-definite. The root is a D- 
dimensional vector drawn z. Each £ WL D is drawn 
Hi ~ Afor(y p ^,5iA.), where p(i) is the parent of i in 
the tree, drawn conditioned on their parent. 

The coalescent is a very popular model in popula- 
tion genetics (it corresponds to a limiting case of 
the Wright-Fisher model), but has been plagued with 
the lack of efficient inference algorithm. (Most infer- 
ence occurs by Metropolis-Hastings sampling over tree 
structures.) Recently, Teh et al. [2] proposed a col- 
lection of efficient bottom-up agglomerative inference 
algorithms for the coalescent. The one we make use is 
called Greedy-Ratel and proceeds in a greedy manner, 
merging nodes that want to coalesce most quickly. In 
the case of Greedy-Ratel, the exponential rate is fixed 
as 1. Belief propagation is used to marginalize out in- 
ternal nodes yi. If we associate with each node in the 
tree a mean y and variance v message, we can com- 
pute messages as Eq (JT|), where i is the current node 
and li and ri arc its children. 



Vi = [(Vu + (tli - ti)A) + (v,-i + (tri - ti)A) 
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Vi = [VliiVU + (*H - *0 A ) 1 + Priori + (Ui - U)A) J ] Vi 



Importantly, this model is applicable when the a^s are 
not known entirely, but are represented by Gaussians. 
This can be done efficient since, given a hierarchical 
structure, inference is simply message passing in a 
Gaussian random field. (We will need this property 
in order to perform expectation-maximization.) 

3 LATENT HIERARCHY MODELS 

In this section, we present a model for domain adapta- 
tion (DA) and a model for multitask learning (MTL) , 
plus some minor variants. (The variants are evaluated 
in Section |4j) As mentioned previously, the structure 
of the two models is the same: they differ in what 
information is shared. 

To fix notation, suppose that we wish to learn K dif- 
ferent hypotheses (K domains in DA or K tasks in 
MTL). We suppose that we have training data for 
each hypothesis, with labeled examples examples 
for hypothesis k. (Notational confusion warning: in 
reference to the coalescent, the K hypotheses will be 
the leaves of the coalescent tree, so this is more akin 
to a if-coalescent.) The inputs are drawn from R 13 
and outputs from y, where y = M for regression tasks 
or y = {— 1,+1} for classification tasks. We assume 
a distribution over MP for each hypothesis (in 

MTL, we assume identical distributions T>( k > = V). 
Our data thus has the form {{(x£ fc \t/„ ) : n <G [Nk]} : 
k 6 [K]}, where [I] — {1, . . . , 7}, x^ is the nth input 

(k) 

for task k and yk is the corresponding label. Each 
~ iid. We will be using linear or logistic re- 
gression, parameterized by hypothesis-specific weight 
vectors € M D , where predictions are made on the 
basis of «/ fe ) T Xn . 

One important design choice in both our models is 
whether we explicitly model the input x. In the cases 
where we do not, our model is a conditional model of 
the form p{y \ x). In the cases where we do, our model 
is a joint model that factorizes as p(y | x)p(x). In this 
case, the same tree structure is used to model both 
the conditional likelihood of y given x and the data 
itself. In effect, this gives more data on which to learn 
the tree structure, at the cost that it might not be 
directly related to the prediction problem. We refer to 
this choice in the future as "model the data." 

3.1 DOMAIN ADAPTATION 

We propose the following model for domain adapta- 
tion. The basic idea is to generate a tree structure ac- 
cording to a i-T-coalcscent and then propagate weight 
vectors along this tree. The root of the tree corre- 



sponds to the "global" weight vector and the leaves 
correspond to the task-specific weight vectors. We as- 
sume the weight vectors evolve according to Brownian 
diffusion. Our generative story is: 

1. Choose a global mean and covariance (n^ \ A) ~ 
AforIW{0,a 2 I,D + 1). Q 

2. Choose a tree structure (ir, 8) ~ Coalescent over 
K leaves. 

3. For each non-root node i in n (top-down): 

(a) Choose /x« - Afor(^^\SiA), where 
Pir(i) is the parent of i in n. 

4. For each domain k 6 [K]\ 

(a) Denote by = fj,^ where i is the leaf in 
7r corresponding to domain k. 

(b) For each example n € [Nk]' 

i. Choose input x^ <~ T>( k \ 

ii. Choose output yn by: 
Regression: Nor^w^ 1 x^\ p 2 ) 
Classification: Bin{\/{\ + e^ w{k)T x ^ )) 

Here, p 2 and a 2 are hyperparameters that we assume 
are known (we use held-out data to set them). 

We consider the following variants of this model: Is A 
is assumed diagonal or full? Do we explicitly model 
the data? We call these: 

Diag Diagonal A, do not model the data. 
Diag+X Diagonal A, do model the data. 
Full Full A, do not model the data. 
Full+X Full A, do model the data. 

In the case where the input data is modeled explicitly 
(i.e., Diag+X and Full+X), we assume a base param- 
eter vector over X generated at the root (in step (1)), 
propogated down the tree (in step (3)) and used to 
generate the inputs Xn (in step (4.b.i)). In the case 
that the input is modeled, we always assume diagonal 
covariance on the input. For continuous data, we use a 
Gaussian mutation kernel, as in step 4. a. For discrete 
data, we use a multinomial equilibrium distribution 
q d and transition rate matrix Qd = Ad.diq^^-K — I) 
where Ik is a vector of K ones, while the transition 
probability matrix for entry d in a time interval of 
length <5 is e Qdt = e~ SA ^ d I + (1 - e- SAd - d )q d T l K . 

2 We denote by Afor2W(p,,A j m, *, v) the Normal- 
Inverse- Wishart distribution with prior mean m, prior co- 
variance * and v degrees of freedom. 



3.2 MULTITASK LEARNING 



3.3 INFERENCE 



In the multitask learning case, we no longer wish to 
share the weight vectors, but rather wish to share their 
covariance structure. This model is slightly more dif- 
ficult to specify because Brownian motion no longer 
makes sense over a covariance structure (for instance, 
it will not maintain positive semi-definiteness) . Our 
solution to this problem is to decompose the covari- 
ance structure into correlations and standard devia- 
tions. We assume a constant, global correlation ma- 
trix and only allow the standard deviations to evolve 
over the tree. (The idea of decomposing the covari- 
ance comes from [TT], section 19.2.) We model the log 
standard deviations using Brownian diffusion. 

In particular, our model assumes that each node in the 
tree is associated with a diagonal log standard devia- 
tion matrix E M. DxD . The weight vector for task k 
is then drawn Gaussian with zero mean and covariance 
given by ( exp S (i) )R( exp S W ) , where ReR DxD are 
the shared correlations (with diagonal elements equal 
to 1). Our prior on R is: 

D 

p(R) oc (detR)^ 1 ^- 1 )- 1 Y[(detR (ll) )-^/ 2 

i=l 

(2) 

Here, Rfu) is the ith principle submatrix of R. This 
is the marginal distribution of R when SRS has an 
inverse- Wishart prior with identity prior covariance 
and D + 1 degrees of freedom, which leads to uniform 
marginals for each pairwise correlation. 

Given this setup, our multitask learning model has the 
following generative story: 

-> 1 . Choose R by Eq Q and deviation covariance A ~ 
lW(o~ 2 I,D + l). 

2. Choose a tree structure (jr, 6) ~ Coalescent over 
K leaves. 

3. For each non-root node i in tt (top-down): 

-» (a) Choose S w ~ Afor(S (pAl) \ &A), where Pn (i) 
is the parent of i in w. 

4. For each task k € [K]: 

— > (a) Choose w^ k > by (i is the leaf associated with 
task k): A/br(0, ( exp S (i) )R( exp S (l) )) 
(b) For each example n € [Nk]'- 
— > i. Choose input x^f 1 ~ T>. 
ii. Choose output y4 by: 

Regression: Afor{w^ T p 2 ) 
Classification: ®n(l/(l + e^"^^)) 

The steps that differ from the the domain adaptation 
model are marked with an arrow (— >). 



For both the DA and MTL models, we perform in- 
ference using an expectation-maximization algorithm. 
The latent variables in both algorithms are the vari- 
ables associated with the leaves of the trees (in DA: 
the weight vectors; in MTL: the log standard devia- 
tions). The parameters are everything else: the tree 
structure tt and times S, the Brownian covariance A 
and all other prior parameters. 

3.3.1 Domain adaptation 

We begin with the domain adaptation model. For sim- 
plicity, we consider the case where the input data is not 
modeled. In the E-step, we compute expectations over 
the leaves (classifiers). In the M-step, we optimize the 
tree structure and the other hyperparameters. 

E-step: The E-step can be performed exactly in the 
case of regression (the expectations of the classifiers 
are simply Gaussian). In the case of classification, we 
approximate the expectations by Gaussians (via the 
Laplace approximation). In particular, for each do- 
main k, we compute: 



N k 

w {k) = argmaxp(u;) TT p(y r ( / c) | x {k \w) (3) 

n=l 

C (fc) = ( X « T A< fc >X< fc >) ^ + {5A) -1 (4) 

In Eq p(w) is the prior on w given by its parent 
in the tree; the likelihood term is the data likelihood 
(logistic for classification, or Gaussian for regression). 
We solve the optimization problem by conjugate gra- 
dient, is the mean of the Gaussian representing 
the expectation of the fcth weight vector. The covari- 
ance of the estimate is C^ k \ with A^ diagonal. For 
regression, A^ = I; for classification, A^ has entries 
A« = si k \l-s^), where a« = l/(l + e - WT ^). 

M-step: Here, we optimize (n, d) by integrating out 
fis associated with internal nodes (using belief propa- 
gation). This can be done efficiently using the Greedy- 
Ratel algorithm |14j . Optimize A as the mode of an 
Inverse- Wishart with D + K + 1 degrees of freedom 
and mean X: 

E = I + Di T (v (l * <*» + + t w a) ~* D t (5) 

i 

D i = ^ l '^-n^^ , t «)=*C«wW) + a(MO) (6) 



Here, Z ff (i) and rv(i) are the left and right children 
respectively of node i in it. t>w is the variance of node i 
(obtained by Eq Q for leaves or via belief propagation 
for internal nodes) . The sum in Eq ^ ranges over all 
non-leaf nodes in n. 

We initialize EM by computing for each task ac- 
cording to a maximum a posteriori estimate with zero 
mean and tr 2 I variance. This initialization effectively 
assumes no shared structure. 

3.3.2 Multitask learning 

Constructing an exact EM algorithm for the multi- 
task learning model is significantly more complex. The 
complexity arises from the convolution of the Normal 
(over w) with the log-Normal (over S). This makes 
the computation of exact expectations (over S) in- 
tractable. We therefore use the popular "hard EM" 
approximation, in which we estimate the expectation 
of the latent variables (S) with a point mass centered 
at their mode. (Experiments in the domain adapta- 
tion model show that the hard EM approximation to 
w does not affect results.) 

The only additional complication is that of optimizing 
R (the overall correlations) and each S^ 1 ' (the per- 
node standard deviations) . R can be handled exactly 
as A in the domain adaptation case: see Eq (|5j, but 
constrained to have ones along the diagonal. The case 
for S W is slightly more involved. We first maximize w 
as before, and then also maximize S. The log posterior 
and its derivative have the forms below, where C is a 
constant independent of S and W = diagiu: 

logp(S) = -trS-itr[(S- V) 1 Ar\S - P)] 

- - tr [W(e- s R" 1 e - s )W] + C 
V s logp(S) = -I - (S — PJA" 1 + W(e- s R- 1 e - s )W 

Here P is the (diagonal) matrix at the parent of the 
current node in the hierarchy. We optimize S by gradi- 
ent descent with step size (0.1/ iter) until convergence 
of S to lCT 6 . 

4 EXPERIMENTAL RESULTS 

We conduct experiments on two domain adaptation 
problems (sentiment analysis |3J and landmine detec- 
tion |15j). and one multitask learning problem (based 
on a construction of 20-newsgroups previoulsy used for 
MTL 13]). The relevant dataset statistics for these 
data sets are in Table [T] Note that for both sentiment 
and 20-newsgroups, we project the data down to 50 
dimensions using PCA. In all cases, we run EM for 
20 iterations and choose the iteration for which the 
likelihood of 10% held-out training data is maximized. 



Table 2: Performance on all tasks by competing mod- 
els, 
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N=6400 


mine 


NG 


Indp 
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74.5% 
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63.6% 


75.7% 
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67.8% 
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Bickel 


68.0% 


72.5% 


55.5% 


74.1% 
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Full 


72.2% 


80.5% 


56.2% 


75.8% 


Diag 


71.9% 


80.4% 


55.8% 


75.3% 


Full+X 


70.1% 


75.9% 


55.0% 


74.7% 


Diag+X 


70.1% 


75.8% 


55.1% 


74.6% 


Data 


70.1% 


75.8% 


54.9% 


72.0% 



For all experiments, we compare against the following 
baselines and alternative approaches: 

pool: pool all the data and learn a single model 

indp: train separate models for each domain/task 

feda: the "augment" approach of by Daume III [7] 

yaxue: the flexible matrix stick breaking process 
method of Dunson, Xue and Carin [5] 

bickel: the discriminative method of Bickel et al. 

The results for all data sets and all methods are shown 
in Table [2] Here, we also compare all five settings 
of the Coalescent model (full covariance and diagonal 
covariance, with and without the data, and then the 
tree derived just by clustering the data). Here, we can 
see that the more complex Coalescent-based models 
tend to outperform the other approaches. 

4.1 DOMAIN ADAPTATION: 
SENTIMENT ANALYSIS 

Our first experiment is on sentiment analysis data 
gathered from Amazon [3]. The task is to predict 
whether a review is positive or negative based on the 
text of the review. There are eight domains in this 
task: apparel (a), books (b), DVD (d), electronics (e), 
kitchen (k), music (m), video (v) and other (o). If we 
cluster these tasks on the basis of the data, we obtain 
the tree shown in Figure [2] 

In our first experiment, we treat every domain equally 
and vary the amount of data used to learn a model. In 

3 The original method works only for two domains. We 
extend it to multiple domains in two ways: first, we do 
a one-versus-rest approach; second, we do a one-versus- 
one approach. The results presented here are oracle in the 
sense that they optimistically choose the better approach 
for each data set and each domain. 



Table 1: Data set statistics for two DA problems and one MTL problem. The number of training and test 
examples are averages across the K tasks and are presented with percentage standard deviation. 



Model 
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# Tasks 


# Features 


# Train 


# Test 


DA 


Sentiment [3] 
Landmine detection [T5] 


8 

29 


5964 
9 


9151±43% 
409±17% 


2288±43% 
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MTL 


20- newsgroups [13 


10 


925 


1127±8% 


751 ±8% 
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Figure 2: Coalescent tree obtained on sentiment data 
just using the data points. 



Figure [3] we show the results of the coalescent-based 
model (with full covariance but without data: Full), 
baselines, and comparison methods. As we can see, the 
coalescent-based approach dominates, even with very 
many data points (6400 per domain). In Table [2] we 
see that moving from full to diagonal covariance does 
not hurt significantly. Adding the data hurts perfor- 
mance significantly, and brings the performance down 
to the level of Data, the model that uses the data-based 
tree. In comparison to previously published results on 
this problem [3J, our results are not quite as good. 
However, prior results depend on a large amount of 
prior knowledge in terms of "pivot features," which 
our model does not require, and also begin with a dif- 
ferent feature representation. 

In Figure p4j we show the trees after ten iterations of 
EM. We can see a difference between these trees and 
the tree built just on the data (cf., Figure pt. For 
instance, the data tree thinks that "music" is more 
like "appliances" than it is like "DVDs," something 
that does not happen in the EM tree. 

In the next experiments, we select one task as the "tar- 
get". We use 6400 examples from all the "source" 
tasks and vary the amount of labeled target data. We 
perform an evaluation on four targets, the same as 
those used previously [3J : books, DVD, electronics and 
kitchen. These results are shown in Figure [5j Here, 
we again see that the coalescent-based approach out- 
performs the baselines. However, for many of these 
per-target results, the feda baseline is the consistent- 
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Figure 3: Accuracies on sentiment analysis data as 
number of data points per domain increases (coal = 
Full). 



best alternative. One somewhat surprising result is 
that adding more and more target data does not ap- 
pear to help significantly for this problem. 

4.2 DOMAIN ADAPTATION: LANDMINE 
DETECTION 

The second domain adaptation task we attempt is 
landmine detection |15j . To conserve space, we only 
present overall results and results for one subtask: the 
last one. To uncrowd the figure, we also limit the base- 
line models to a subset of approaches; recall that the 
full results are shown in Table 2. These are shown in 
Figure [6| Note that the performance measure here is 
AUC: there are very few positives in this data (around 
5%). Here, we see that on the target-based evaluation, 
the coalescent-based approach dominates. For small 
amounts of data it performs equivalently to indp, but 
the gap increases for more data. 

4.3 MULTITASK LEARNING: 
20-NEWSGROUPS 

Our final evaluation is on data drawn from 20- 
newsgroups. Here, we construct 10 binary classifica- 
tion problems, each of which is its own task. We use 
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Figure 5: Per-target sentiment results. 
Baseball vs Politics 
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Figure 7: Results on 20-newsgroups multitask learning problem. 
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Figure 4: EM tree on the sentiment data. 



Figure 6: Landmine detection results. 



an identical setup to previous work |13j . As before, we 
present overall results and then results for two sub- 
tasks. The subtasks we choose are "Baseball versus 
Politics" and "IBM Hardware versus Forsale" - these 
were chosen as an example of good and bad transfer 
from previous studies [13 . Here, we have cut out the 
pool baseline because it does not make sense in a pure 
MTL setting. To uncrowd the figure, we also limit the 
baseline models to a subset of approaches; recall that 
the full results are shown in Table |2j The results are 
in Figure [7] Here, we see that the coalescent-based 
model overall outperforms the baselines, and further 



maintains an advantage for Baseball-versus-Politics, 
for which we expect a reasonable amount of trans- 
fer. One significant difference between these results 
and the DA results is that on the per-target results, 
in the DA case, our model continued to outperform. 
However, in the MTL case, with enough labeled target 
data, the independent classifiers quickly catch up. In 
comparison to prior results on this problem [13 , our 
rate of improvement is roughtly comparable. 
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Figure 8: Adding bogus data to sentiment task. 

4.4 RESULTS ON NOISY DOMAINS 

One additional question that arises in work related to 
a large number of domains (or tasks) is whether the 
addition of unrelated domains can damage a model. 
In this section, we explore the effect of the addition 
of unrelated domains on our learning algorithms. We 
simulate this on the sentiment data by adding a task 
obtained by scrambling the features of one of the true 
tasks, where we vary the percentage scrambled. 

The results are shown in Figure [8] Here, we see that 
there is a slighly trend toward degradation in perfor- 
mance for the original tasks as the amount of noise 
in the new task increases. This is true for all of the 
learning algorithms; unfortunately, this includes our 
own. One would hope that the model could learn to 
not share information with this irrelevant task, but ap- 
parently the prior toward short trees is too strong to 
overcome the noiser. Addressing this remains open. 

5 DISCUSSION 

We have presented two models: one for domain adap- 
tation (DA) and one for multitask learning (MTL). 
Inference in our models is based on expectation max- 
imization. We observe significant performance im- 
provements on three very different data sets from our 
models. The only distinction between the models is 
what aspects are shared. We believe this is a reason- 
able way to divide up the DA/MTL landscape. 

Two interesting special cases fall out of our model. 
First, if we set A = I and construct a tree where ev- 
ery node branches directly from the root, our model is 
precisely the linear multitask model proposed by Yu 
et al. [T5J- Second, we consider the fact that a special 
case of the coalescent can describe the same distribu- 
tion as a Dirichlet process |10j . Through this view, we 
can see that Dirichlct-process based multitask model 
of Xue et al. [15] is achieved as a special case. 



There are several ideas in the literature for both DA 
and MTL that are not reflected in our model. An easy 
example is the idea that it should be difficult to build 
a classifier for separating source from target data in a 
DA context pQ. Similar ideas have been exploited in 
discriminative models for domain adaptation [2], How- 
ever, these models are most successful when there is no 
labeled target data: a case we have not considered. It 
is an open question to address this in our framework. 
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